Search CORE

89 research outputs found

Efficient decoding algorithms for generalized hidden Markov model gene finders

Author: Delcher Arthur L
Majoros William H
Pertea Mihaela
Salzberg Steven L
Publication venue: BioMed Central
Publication date: 01/01/2005
Field of study

BACKGROUND: The Generalized Hidden Markov Model (GHMM) has proven a useful framework for the task of computational gene prediction in eukaryotic genomes, due to its flexibility and probabilistic underpinnings. As the focus of the gene finding community shifts toward the use of homology information to improve prediction accuracy, extensions to the basic GHMM model are being explored as possible ways to integrate this homology information into the prediction process. Particularly prominent among these extensions are those techniques which call for the simultaneous prediction of genes in two or more genomes at once, thereby increasing significantly the computational cost of prediction and highlighting the importance of speed and memory efficiency in the implementation of the underlying GHMM algorithms. Unfortunately, the task of implementing an efficient GHMM-based gene finder is already a nontrivial one, and it can be expected that this task will only grow more onerous as our models increase in complexity. RESULTS: As a first step toward addressing the implementation challenges of these next-generation systems, we describe in detail two software architectures for GHMM-based gene finders, one comprising the common array-based approach, and the other a highly optimized algorithm which requires significantly less memory while achieving virtually identical speed. We then show how both of these architectures can be accelerated by a factor of two by optimizing their content sensors. We finish with a brief illustration of the impact these optimizations have had on the feasibility of our new homology-based gene finder, TWAIN. CONCLUSIONS: In describing a number of optimizations for GHMM-based gene finders and making available two complete open-source software systems embodying these methods, it is our hope that others will be more enabled to explore promising extensions to the GHMM framework, thereby improving the state-of-the-art in gene prediction techniques

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Digital Repository at the University of Maryland

Minimus: a fast, lightweight genome assembler

Author: Delcher Arthur L
Pop Mihai
Salzberg Steven L
Sommer Daniel D
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

BACKGROUND: Genome assemblers have grown very large and complex in response to the need for algorithms to handle the challenges of large whole-genome sequencing projects. Many of the most common uses of assemblers, however, are best served by a simpler type of assembler that requires fewer software components, uses less memory, and is far easier to install and run. RESULTS: We have developed the Minimus assembler to address these issues, and tested it on a range of assembly problems. We show that Minimus performs well on several small assembly tasks, including the assembly of viral genomes, individual genes, and BAC clones. In addition, we evaluate Minimus' performance in assembling bacterial genomes in order to assess its suitability as a component of a larger assembly pipeline. We show that, unlike other software currently used for these tasks, Minimus produces significantly fewer assembly errors, at the cost of generating a more fragmented assembly. CONCLUSION: We find that for small genomes and other small assembly tasks, Minimus is faster and far more flexible than existing tools. Due to its small size and modular design Minimus is perfectly suited to be a component of complex assembly pipelines. Minimus is released as an open-source software project and the code is available as part of the AMOS project at Sourceforge

Crossref

Springer - Publisher Connector

PubMed Central

Digital Repository at the University of Maryland

A unified model explaining the offsets of overlapping and near-overlapping prokaryotic genes.

Author: Delcher Arthur L.
Kingsford Carl
Salzberg Steven L.
Publication venue: Molecular Biology and Evolution
Publication date: 01/01/2007
Field of study

Overlapping genes are a common phenomenon. Among sequenced prokaryotes, more than 29% of all annotated genes overlap at least 1 of their 2 flanking genes. We present a unified model for the creation and repair of overlaps among adjacent genes where the 3# ends either overlap or nearly overlap. Our model, derived from a comprehensive analysis of complete prokaryotic genomes in GenBank, explains the nonuniform distribution of the lengths of such overlap regions far more simply than previously proposed models. Specifically, we explain the distribution of overlap lengths based on random extensions of genes to the next occurring downstream stop codon. Our model also provides an explanation for a newly observed (here) pattern in the distribution of the separation distances of closely spaced nonoverlapping genes. We provide evidence that the newly described biased distribution of separation distances is driven by the same phenomenon that creates the uneven distribution of overlap lengths. This suggests a dynamic picture of continual overlap creation and elimination

PubMed Central

Digital Repository at the University of Maryland

Versatile and open software for comparing large genomes

Author: Antonescu Corina
Delcher Arthur L
Kurtz Stefan
Phillippy Adam
Salzberg Steven L
Shumway Martin
Smoot Michael
Publication venue: BioMed Central
Publication date: 01/01/2004
Field of study

The newest version of MUMmer easily handles comparisons of large eukaryotic genomes at varying evolutionary distances, as demonstrated by applications to multiple genomes. Two new graphical viewing tools provide alternative ways to analyze genome alignments. The new system is the first version of MUMmer to be released as open-source software. This allows other developers to contribute to the code base and freely redistribute the code. The MUMmer sources are available at

Springer - Publisher Connector

PubMed Central

Digital Repository at the University of Maryland

Core Gene Set As the Basis of Multilocus Sequence Analysis of the Subclass Actinobacteridae

Author: Adékambi Toïdi
Butler Ray W.
Delcher Arthur L.
Drancourt Michel
Hanrahan Finnian
Shinnick Thomas M.
Publication venue: Public Library of Science
Publication date: 31/03/2011
Field of study

Comparative genomic sequencing is shedding new light on bacterial identification, taxonomy and phylogeny. An in silico assessment of a core gene set necessary for cellular functioning was made to determine a consensus set of genes that would be useful for the identification, taxonomy and phylogeny of the species belonging to the subclass Actinobacteridae which contained two orders Actinomycetales and Bifidobacteriales. The subclass Actinobacteridae comprised about 85% of the actinobacteria families. The following recommended criteria were used to establish a comprehensive gene set; the gene should (i) be long enough to contain phylogenetically useful information, (ii) not be subject to horizontal gene transfer, (iii) be a single copy (iv) have at least two regions sufficiently conserved that allow the design of amplification and sequencing primers and (v) predict whole-genome relationships. We applied these constraints to 50 different Actinobacteridae genomes and made 1,224 pairwise comparisons of the genome conserved regions and gene fragments obtained by using Sequence VARiability Analysis Program (SVARAP), which allow designing the primers. Following a comparative statistical modeling phase, 3 gene fragments were selected, ychF, rpoB, and secY with R2>0.85. Selected sets of broad range primers were tested from the 3 gene fragments and were demonstrated to be useful for amplification and sequencing of 25 species belonging to 9 genera of Actinobacteridae. The intraspecies similarities were 96.3–100% for ychF, 97.8–100% for rpoB and 96.9–100% for secY among 73 strains belonging to 15 species of the subclass Actinobacteridae compare to 99.4–100% for 16S rRNA. The phylogenetic topology obtained from the combined datasets ychF+rpoB+secY was globally similar to that inferred from the 16S rRNA but with higher confidence. It was concluded that multi-locus sequence analysis using core gene set might represent the first consensus and valid approach for investigating the bacterial identification, phylogeny and taxonomy

Public Library of Science (PLOS)

PubMed Central

Correction: Serendipitous discovery of Wolbachia genomes in multiple Drosophila species

Author: Salzberg Steven L
Dunning Hotopp Julie C
Delcher Arthur L
Pop Mihai
Smith Douglas R
Eisen Michael B
Nelson William C
Publication venue: BioMed Central
Publication date: 24/06/2005
Field of study

A correction to Serendipitous discovery of Wolbachia genomes in multiple Drosophila species by SL Salzberg, JC Dunning Hotopp, AL Delcher, M Pop, DR Smith, MB Eisen and WC Nelson. Genome Biology 2005, 6:R2

Crossref

PubMed Central

Caltech Authors

Serendipitous discovery of Wolbachia genomes in multiple Drosophila species

Author: Delcher Arthur L
Eisen Michael B
Hotopp Julie C Dunning
Nelson William C
Pop Mihai
Salzberg Steven L
Smith Douglas R
Publication venue: BioMed Central
Publication date: 01/01/2005
Field of study

BACKGROUND: The Trace Archive is a repository for the raw, unanalyzed data generated by large-scale genome sequencing projects. The existence of this data offers scientists the possibility of discovering additional genomic sequences beyond those originally sequenced. In particular, if the source DNA for a sequencing project came from a species that was colonized by another organism, then the project may yield substantial amounts of genomic DNA, including near-complete genomes, from the symbiotic or parasitic organism. RESULTS: By searching the publicly available repository of DNA sequencing trace data, we discovered three new species of the bacterial endosymbiont Wolbachia pipientis in three different species of fruit fly: Drosophila ananassae, D. simulans, and D. mojavensis. We extracted all sequences with partial matches to a previously sequenced Wolbachia strain and assembled those sequences using customized software. For one of the three new species, the data recovered were sufficient to produce an assembly that covers more than 95% of the genome; for a second species the data produce the equivalent of a 'light shotgun' sampling of the genome, covering an estimated 75-80% of the genome; and for the third species the data cover approximately 6-7% of the genome. CONCLUSIONS: The results of this study reveal an unexpected benefit of depositing raw data in a central genome sequence repository: new species can be discovered within this data. The differences between these three new Wolbachia genomes and the previously sequenced strain revealed numerous rearrangements and insertions within each lineage and hundreds of novel genes. The three new genomes, with annotation, have been deposited in GenBank

Springer - Publisher Connector

PubMed Central

eScholarship - University of California

Digital Repository at the University of Maryland

High-throughput sequence alignment using Graphics Processing Units

Author: AL Delcher
AL Delcher
Amitabh Varshney
Arthur L Delcher
C Shaffer
Cole Trapnell
D Gusfield
E Ukkonen
EW Myers
I Buck
J Mellor-Crummey
JD Owens
M Brudno
M Charalambous
M Hohl
M Pop
Michael C Schatz
MJ Harris
NK Govindaraju
nVidia
P Weiner
S Kurtz
S Kurtz
SF Atschul
W Liu
W Pearson
WJ Dally
Y Juekuan
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background The recent availability of new, less expensive high-throughput DNA sequencing technologies has yielded a dramatic increase in the volume of sequence data that must be analyzed. These data are being generated for several purposes, including genotyping, genome resequencing, metagenomics, and <it>de novo </it>genome assembly projects. Sequence alignment programs such as MUMmer have proven essential for analysis of these data, but researchers will need ever faster, high-throughput alignment tools running on inexpensive hardware to keep up with new sequence technologies. Results This paper describes MUMmerGPU, an open-source high-throughput parallel pairwise local sequence alignment program that runs on commodity Graphics Processing Units (GPUs) in common workstations. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA) from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. By processing the queries in parallel on the highly parallel graphics card, MUMmerGPU achieves more than a 10-fold speedup over a serial CPU version of the sequence alignment kernel, and outperforms the exact alignment component of MUMmer on a high end CPU by 3.5-fold in total application time when aligning reads from recent sequencing projects using Solexa/Illumina, 454, and Sanger sequencing technologies. Conclusion MUMmerGPU is a low cost, ultra-fast sequence alignment program designed to handle the increasing volume of data produced by new, high-throughput sequencing technologies. MUMmerGPU demonstrates that even memory-intensive applications can run significantly faster on the relatively low-cost GPU than on the CPU.</p

CiteSeerX

Crossref

Cold Spring Harbor Laboratory Institutional Repository

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Digital Repository at the University of Maryland

Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering

Author: Altschul
Angly
Arthur L. Delcher
Balzer
Benson
Bo Liu
Borodovsky
Brady
Brulc
Chatterji
Chen
Costello
Curtis
David R. Kelley
Delcher
Delcher
Diaz
Dinsdale
Dohm
Fickett
Fleischmann
Handelsman
Hastie
Hoff
Hoff
Hu
Kelley
Kislyuk
Kristiansson
Lozupone
Majoros
Margulies
Mavromatis
Mihai Pop
Monzoorul Haque
Noguchi
Patil
Pruitt
Rho
Rocha
Rusch
Schatz
Schloss
Sharon
Shendure
Steven L. Salzberg
Tringe
Turnbaugh
Turnbaugh
Tyson
Venter
Whitman
Yok
Yooseph
Zhu
Publication venue: Oxford University Press
Publication date: 01/11/2013
Field of study

Environmental shotgun sequencing (or metagenomics) is widely used to survey the communities of microbial organisms that live in many diverse ecosystems, such as the human body. Finding the protein-coding genes within the sequences is an important step for assessing the functional capacity of a metagenome. In this work, we developed a metagenomics gene prediction system Glimmer-MG that achieves significantly greater accuracy than previous systems via novel approaches to a number of important prediction subtasks. First, we introduce the use of phylogenetic classifications of the sequences to model parameterization. We also cluster the sequences, grouping together those that likely originated from the same organism. Analogous to iterative schemes that are useful for whole genomes, we retrain our models within each cluster on the initial gene predictions before making final predictions. Finally, we model both insertion/deletion and substitution sequencing errors using a different approach than previous software, allowing Glimmer-MG to change coding frame or pass through stop codons by predicting an error. In a comparison among multiple gene finding methods, Glimmer-MG makes the most sensitive and precise predictions on simulated and real metagenomes for all read lengths and error rates tested

Crossref

Harvard University - DASH

PubMed Central